refactor(dialect): extract shared RegexSafety and SqlEmitters helpers#31
Merged
Merged
Conversation
The five RE2-style dialects each carried a byte-identical copy of the ReDoS-safety validation (length/group/nesting limits, nested-quantifier and quantified-alternation detection, and the three helper methods), and the six dialect implementations duplicated several SQL-fragment emitters (EXTRACT, array join, regex match, JSON path probe, json_each membership, binary-function split, arrow JSON access). Consolidate both into two focused helpers in the dialect package: - RegexSafety: checkLength + checkReDoS plus the shared limits/patterns. Also normalizes the rejection message — MySQL/BigQuery previously emitted "Invalid regex pattern" while the others emitted "Invalid pattern in expression"; all dialects now use the latter consistently. - SqlEmitters: writeBinaryCall, writeArrayJoin, writeJsonEachMembership, writeJsonPathProbe, writeInfixRegex, writeStandardExtract / writeExtractWithPostgresDow, and writeArrowJsonAccess. Per-dialect field-name escaping is threaded through as a method reference so BigQuery's distinct escaping is preserved. Each dialect keeps its own override and delegates the body; dialects whose output genuinely differs keep their inline implementation. Net ~878 fewer lines across the 11 dialect files. No behavioral change — the per-dialect SQL-output tests pass unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Removes duplicated code across the dialect layer that was surfaced by analyzing the repo with the
file-search-ontools (find_duplicate_functions,complexity,code_graph). Two clusters were duplicated across the per-dialect classes; both are now consolidated into focused helpers in thedialectpackage, with each dialect keeping an explicit override that delegates.1.
RegexSafety— shared ReDoS validation (CWE-1333)All five RE2-style dialects (
PostgresRegex,MySqlRegex,DuckDbRegex,BigQueryRegex,SparkRegex) carried a byte-identical copy of:MAX_PATTERN_LENGTH/MAX_GROUPS/MAX_NESTING_DEPTHlimits and the two detector patterns;validateNoNestedQuantifiers(the repo's single most complex function, repeated 5×),countUnescapedParens,computeMaxNestingDepth.Now exposed as
RegexSafety.checkLength(...)+RegexSafety.checkReDoS(...).Bonus fix: the same nested-quantifier rejection previously threw
"Invalid pattern in expression"in Postgres/DuckDB/Spark but"Invalid regex pattern"in MySQL/BigQuery. All dialects now emit the former consistently.2.
SqlEmitters— shared SQL-fragment emittersThe six
*Dialectclasses duplicated several emitter bodies. Consolidated into parameterized helpers:writeBinaryCallwriteSplitwriteArrayJoinwriteJoinwriteJsonEachMembershipwriteJSONArrayMembership,writeNestedJSONArrayMembershipwriteJsonPathProbewriteJSONExtractPathwriteInfixRegexwriteRegexMatchwriteStandardExtract/writeExtractWithPostgresDowwriteExtractwriteArrowJsonAccesswriteJSONFieldAccessPer-dialect field-name escaping is threaded through as a method reference, so BigQuery's
\'escaping is preserved alongside the''used elsewhere. Dialects whose output genuinely differs (e.g. SQLite's unsupported-op throws, MySQL'sJSON_UNQUOTEjoin) keep their inline implementation.Impact
find_duplicate_functionsnow reports zero duplicate groups among these methods at the standard threshold.Testing
./gradlew testpasses. The conversion tests assert exact SQL output per dialect, so behavior preservation is directly verified.🤖 Generated with Claude Code